Method and arrangement for local sychronization in master-slave distributed communication systems

ABSTRACT

For switching or transmitting data packets, one can provide communication systems which consist of several modules—operating in parallel on segments of a packet—to increase speed and handling capacity. One module acts as master ( 21 ), the others are slave modules ( 22 ) controlled by control signals ( 25 ) derived by the master module. It is important that in each module the data segment and the respective control signal of each packet are correctly synchronized, because in large systems the data paths carrying packet segments and the control signal paths may have substantially different delays. The invention provides for measurement of the propagation delay differences and for introducing a controlled delay in each slave module, so that data segments and control signals can be correctly correlated by delaying either the one or the other. Synchronization packets are transmitted besides normal data packets, for obtaining time stamps which are used to determine the delay difference.

FIELD OF THE INVENTION

The present invention relates to packet handling technology inelectronic communication networks and the design structure of theprocessing arrangements used therefor. More particularly it refers to amethod and apparatus for synchronizing multiple processing elements (ormodules) operated in parallel, to form the equivalent of a singleprocessing arrangement with aggregate throughput equal to the sum of thecombined aggregate throughput of multiple parallel processing elements.A typical application of this invention is in packet switching systems.

INTRODUCTION

Among all the competing requirements put on the switch fabric designs ofthe current generation, scalability of number of ports andcost-effectiveness are two fundamental issues that should be addressed.Two ways to build a cost-effective and scalable switch fabric aredistinguish. The first option is the widely adopted single-stage switcharchitecture which is very efficient but has scalability limits becauseof its quadratic complexity growth (as a result of linear growth of thenumber of ports). The second option is the multistage switcharchitecture which provides higher throughput by means of moreparallelism, but which is generally more complex and less efficient thansingle stage switches.

A multistage switch architecture is also referred to as a MultistageInterconnection Network (MIN), i.e., a fabric arrangement of “small”single-stage switching modules interconnected via links in multiplestages or mesh-like in such a way that switching and link resources canbe shared by multiple connections resulting in a complexity growthsmaller than N², typically in the order of N logN, where N is the totalnumber of ports of the switch fabric. Although it is recognized thatMINs are needed to obtain very high throughput and support for largenumber of ports, their common introduction has been repeatedly postponedover the last decade. A reason is that continuous new innovations insingle-stage switching system design together with new opportunitiescreated by advances in underlying technologies were able to keep pacewith the market requirement increases over the same period. Also, withintheir range of scalability, single-stage switching architectures remainvery attractive as they provide the most cost- and performance-effectiveway to build an electronic packet switch network.

Single-stage switch architectures can be classified into two types:architectures with centralized control and architectures withdistributed control. The latter type consists of parallel switchingdomains, each having an independent scheduler (control domain). Its maindrawback is that it requires some complexity overhead incurred by loadbalancing and reordering algorithms that handle the packets distributedover the multiple switch domains. In the literature, this is alsoreferred to as Parallel Packet Switching (PPS). On the other hand, theswitch architecture with centralized control only has one switch domainwhich usually consists of several switch slices operated in parallel.Operating multiple switch slices in parallel enables an increase inswitch port speed and thus allows to build a switching core with higherspeed. This approach is used in a number of single-stage switches as itallows to build systems handling large numbers of external links bymultiplexing them onto a single link of higher speed. For a givencircuit technology, there is a limit to the applicability of thistechnique, but within its applicability range it offers the mostcost-effective way to scale to larger sized switches. Other reasons thatmake the single-stage switch designs based on centralized controlapproach very popular, are the singularity of its scheduling scheme andits ability to implement any queuing structure: shared-memory-basedoutput-queued structure, crossbar-based input-queued structure orcombined input-output-queued structure.

The problem concerned with the present invention applies to switcharchitectures with centralized control. The aim is to provide a means toimprove their inherent growth limitation. This is done by facilitatingthe aggregation of multiple switch elements and have them operated inparallel in a so-called Port Speed Expansion mode. This improvement alsoindirectly applies to MIN architectures as they are usually composed ofsingle-stage switching modules.

DESCRIPTION AND DISADVANTAGES OF PRIOR ART

In the computer community, data and pipeline parallelism have long beenexploited to achieve higher bandwidth. When applied to packet switchingtechnology in electronic networks, this translates into packets beingswitched over multiple parallel slices, and is sometimes referred to asPort Speed Expansion.

An early description of port speed expansion can be found in an articleby W. E. Denzel, A. P. J. Engbersen, and I. Iliadis, entitled “Aflexible shared-buffer switch for ATM at Gb/s rates”, published inComputer Networks and ISDN Systems, Vol. 27, No. 4, Jan. 1995, pp.611-624. In this paper, port speed expansion is used to expand the portrate in a modular fashion by stacking multiple slaves chips and havethem controlled by a single-master chip.

A particular port speed expansion embodiment applied to an output queuedswitch architecture is also described in the European patent applicationEP0849917A2.

The problem concerned with the present invention is now in more detailthe following. A well known difficulty of port speed expansion is thecomplexity of its implementation due to the fact that master and slavemodules have to be tightly synchronized. At high port rate, this leadsto complex and/or expensive synchronization logic which usually limitsthe physical degree of parallelism and thus the maximally achievablethroughput. Therefore there is a need to decouple the scalability of aport speed expansion scheme from its implementation complexity incurredby synchronization issues.

In a switch fabric core operated in port speed expansion mode, thecomponent switches are termed as either “Master” or “Slave” switch. Aport speed expanded switch fabric contains one Master, and one or moreSlaves components. Master and Slaves may be connected in any arbitrarytopology such as a chain, a ring, or a tree. The general concept of portspeed expansion is now described/recalled with reference to FIG. 1 whichillustrates an example related to the prior art commercial product IBMPRS64G where only one Slave is used. The PRS64G is a packet routingswitch that implements 32 input and 32 output ports, each running at 2Gb/s, for a total aggregate bandwidth of 64 Gb/s. Combining two of thesechips in port speed expansion mode enables to operate the physical portsat 4 Gb/s and to build a switch fabric with twice the aggregatebandwidth (128 Gb/s). When a packet to be switched is received by theingress fabric interface it is split into several parts, termed here“Logical Unit” (LU) (or later also termed “Segment”). In this particularexample, the number of LU's equals the number of component switches, butthis is not a prerequisite. Next, the ingress fabric interface sends oneLU of each packet to the Master switch, and the following LU to theSlave switch. The first LU contains only part of the initial packetpayload but it has the full packet header which includes handlinginformation. The second LU, which is passed to the Slave, contains onlypayload information and no routing information. The Master handles itsLU according to the routing and Quality-of-Service information carriedby the packet header, and then informs the Slave about its schedulingdecision by sending an appropriate (derived) control information to it.For every LU received by the Master, a derived control information issent to the Slave over a so-called ingress port speed expansion bus.Likewise, when the Master schedules a packet to be transmitted, asimilar control information is sent to the Slave over an egress portspeed expansion bus. Because of the propagation delay of the egresscontrol path, the master egress LU may actually leave earlier than theslave egress LU. In some cases, an additional transmit synchronizationmechanism may be needed between the Master and the Slave, if the twooutgoing LU's are required to reach the egress fabric interface atnearly the same time. From the description above, it is obvious that aport speed expanded fabric calls for control of the propagation delaysand a precise match of two different flows, namely: the data flow fromingress fabric interface toward fabric core and egress fabric interface(drawn horizontally in FIG. 1) and the control flow from master to oneor multiple Slaves (drawn vertically in FIG. 1). Given the packetduration example of FIG. 1 (128 ns for a 64 Bytes packet) and thecompactness of the switch fabric core (built on a single-board), thiswas easily achieved by ensuring that the control information reaches theSlave within one packet cycle of 128 ns, which is ample of time for asingle board design in the current technology.

Meanwhile, because of continuous increase in data link rates and systemsizes, speed expanded systems have gotten progressively more and moredifficult to build. On one side, the faster data link rates have causedpacket durations to decrease but have required higher degree ofparallelism in the port speed expansion implementations. On the otherside, bigger system sizes have forced designers to distribute the switchfabric over multiple boards and racks, thus increasing link distancesfor data flows and/or control flows within the fabric. Given all thesemore strict system requirements and sizes, it gets very difficult and/orexpensive to precisely control and match the propagation delays betweenelements which are physically distributed and for which packet durationshave decreased at the same time. In particular, it may occur that themultiple LU's from one packet may not arrive at the Master and one ormore Slave switches at the same or close to the same time. In fact, itmay occur that LU's from completely different packets arrive at theMaster and/or the Slave switches at the same or nearly the same time.

Assuming a chain based topology example of 1 Master and N−1 Slaves asdepicted in FIG. 2, a possible solution is to provide each Slave with ameans to measure the latency of the control path at systeminitialization time, and to insert a digital programmable delay into thedata path of each Slave that compensates and matches the propagationdelay of the control path. Measurement of the control path latency isdone relative to a synchronization signal broadcast by the Master to allSlaves. Once the latency of the control path has been measured by eachSlave, the digital programmable delay of the data path is setaccordingly and individually within each Slave, so that the control anddata path delays match on a packet cycle basis. Although this proposalgoes in the right direction, it solves only half of the problem as it isnot able to compensate for different latencies in the port speedexpanded data paths (see Data Path Skew in FIG. 2). In fact, theproposed scheme only works if the system is rather tightly synchronized,such that all LU's sent by the ingress fabric interface reach the fabriccore within a skew window which is less than a packet cycle duration. Ata port rate in the order of 10 Gb/s (OC192), this may be achievable ifthe number of ports allows the physical fabric size to be built in acompact way of say a single electronic rack. For systems of largerdimensions and higher port rates such as 40 Gb/s (OC768), the localsynchronization method not only should compensate for latency of thecontrol path but should also compensate for the unpredictable skew inthe propagation paths of both data and control information, and this forany (arbitrary) topology. Also, in order to be easily scalable, themethod should be able to relax the synchronization constraints incurredby the port speed expansion concept

SUMMARY OF THE INVENTION

Generally, the objective of this invention is to provide a method andapparatus to achieve local synchronization of data and controlinformation at each module of a distributed master-slave communicationsystem of arbitrary topology. Synchronization is achieved bycompensating the unpredictable skew in the propagation paths of data andcontrol information. The magnitude and sign of each compensation isdetermined by sending synchronization packets through the communicationsystem.

Another objective is to provide a means to locally and independentlymeasure the propagation delay difference between the data and thecontrol paths at every synchronization point of the distributed system.This local measurement allows to cope with the inherent speedscalability limits of distributed communication systems with centralizedcontrol, by enabling the system to operate in a locally synchronous butglobally asynchronous fashion. The advantage of this scheme, as opposedto a global synchronization scheme of a master and multiple slaves, isthat the centrally controlled system can be scaled to operate withhigher degree of parallelism, arbitrary number of slaves and arbitrarytopology. In particular, it allows to build plesiochronous systems thatoperate different modules with slightly different frequencies of slowlyvarying phases, which is usually the case in large distributed systems.

In accordance with the present invention, there is provided acommunication system for processing data packets each including a headerwith control information and a data payload. The system comprises aningress port for receiving the data packets, in which ingress port eachdata packet is subdivided into segments. The system further comprises amaster unit and one or more slave units for parallel processing of thesegments. The master unit is adapted to receive the header from eachpacket via a data path and the one or more slave units are adapted toreceive data segments via a data path. Via a control path derivedcontrol information is passable from the master unit to the one or moreslave units. In the system are synchronization providing means providedfor sending synchronization packets also subdivided into segments fromthe ingress port through the system over the same paths as normal datapackets, and for passing synchronization control information through thesystem over the same paths as normal derived control information. Eachof the one or more slave units comprises time shift information means,also referred to as first means, for obtaining, when a synchronizationpacket segment and its corresponding synchronization control informationare received, time shift information representing the propagation delaydifference between the data path and the control path. Each of the oneor more slave units comprises delay means, also referred to as secondmeans, for delaying either a data segment or derived controlinformation, in response to the time shift information obtained by thetime shift information means.

In accordance with a second aspect of the present invention, there isprovided a communication arrangement for processing data packets eachincluding a header with control information and a data payload,

comprising an ingress port for receiving the data packets, in whichingress port each data packet is subdivided into segments,

comprising a communication system with a master unit and one or moreslave units for parallel processing of the segments, the master unit isadapted to receive the header with control information from each packetvia a data path and the one or more slave units are adapted to receivedata segments via a data path; and wherein derived control informationis passed from the master unit to the one or more slave units via acontrol path,

in which arrangement

means, also referred to as synchronization providing means, are providedfor sending synchronization packets also subdivided into segments fromthe ingress port through the system over the same paths as normal datapackets, and for passing synchronization control information through thesystem over the same paths as normal derived control information,

each slave unit comprises first means for obtaining, when asynchronization packet segment and its corresponding synchronizationcontrol information are received, time shift information representingthe propagation delay difference between the data path and the controlpath, and

each slave unit comprises second means for delaying either a datasegment or derived control information, in response to the time shiftinformation obtained by the first means.

In accordance with a third aspect of the present invention, there isprovided a method for local synchronization in a master-slavecommunication system designed for processing data packets eachcomprising a header with control information and a data payload and eachreceivable through at least one ingress port, in which system each datapacket is subdivided into segments in the ingress port for parallelprocessing of the segments;

the system comprising a master unit and one or more slave units forparallel processing of the segments; wherein the master unit receivesthe header with control information from each packet and the one or moreslave units receive data segments via a data path; and wherein derivedcontrol information is passed from the master unit to the one or moreslave units via a control path;

the method comprising the following steps, for ensuring correctcorrelation between received data segments and derived controlinformation in the slave units despite differing propagation delays inthe data path and the control path:

(a) sending a synchronization packet, also subdivided into segments,from the ingress port through the system over the same paths as normaldata packets, and passing a synchronization control information derivedfrom the header of the synchronization packet, through the system overthe same paths as normal derived control information;

(b) obtaining in the one or more slave units, when a synchronizationpacket segment and its corresponding synchronization control informationare received, time shift information representing the propagation delaydifference between the data path and the control path; and

(c) in the one or more slave units, compensate for the propagation delaydifference, represented by the time shift information obtained in step(b), by delaying for each received packet segment either the packetsegment itself or the derived control information.

A particular advantage of this invention is that its synchronizationscheme is locally self-adaptive and that it can be made robust.Self-adaptive means that the synchronization process is performedlocally and autonomously at every synchronization point of thedistributed system, and that no bi-directional communication is requiredbetween neither module of the communication system. Robustness tovarying delays of data and/or control paths can be achieved by sendingthe synchronization packets multiple times through the system, forexample at regular intervals.

A further advantage of this invention is that, since the master/slavesegments can compensate for skew between packet segments, the ingressadapter source is not required to transmit all packet segmentssimultaneously. In fact, it is advantageous for it to send the packetsegments transmitted to the master and the slaves at a time delayed bythe time required for the previous master/slave on the control path toforward the control information to the following slave in the path plusthe difference between the data path skew of those consecutive segmentson the control path. Doing this decreases the amount of bufferingrequired on the data path of the slaves to compensate for the controlpath latency. This advantage holds for single-stage systems or the firststage of a multistage communication system.

The advantage of the relaxed synchronization constraints per stage givemore design freedom for the master plane in both, single- and multistagesystems because the master plane is now temporally independent from theself-adapting slave planes. The advantage of local synchronization inmultistage systems is that no extra latency is added by each stage,which would be the case if each stage were globally synchronized.Because multistage communication systems are also physically larger thansingle-stage systems (in identical technology), relaxed synchronizationconstraints become more important for such systems because the largersystem may span multiple boards/shelves/racks that are connected throughlonger links. With ever increasing bandwidth/faster packet transmissiontimes, the decoupling of synchronization constraints from packet lengthsis an important advantage.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of examples and is notlimited by the shape of the figures or the drawings in which:

FIG. 1 is a block diagram of a prior art electronic switching systemthat implements the general concept of distributed master-slavecommunication system and port speed expansion.

FIG. 2 is a block diagram representing schematically the transmission ofmultiple data packets from an ingress source to a segmentedcommunication system which is (arbitrarily) organized into a chaintopology of 1 master and N−1 slaves.

FIG. 3 is a general schematic overview of a packet communication system(CS) and the corresponding segmented packet communication system whichcan be improved by the present invention.

FIG. 4 is a block diagram representing schematically the transmission ofsynchronization packet segments and of control information derived fromthe segment of the synchronization packet the contains the header, inaccordance with the invention.

FIG. 5 is block diagram of a preferred embodiment of the invention.

FIG. 6 is a block-diagram representing schematically the transmission ofmultiple data packets from the egress side of a segmented communicationsystem to an egress adapter (FIG. 6 a) or to another segmentedcommunication system in the case of a multistage configuration (FIG. 6b).

FIG. 7 is a flow diagram illustrating the synchronization procedure.

DETAILED DESCRIPTION OF THE INVENTION

With general reference to the figures and with special reference now toFIG. 3, a general communication system 30 that transports data entities,hereafter called packets, is considered. Beyond certain stringentrequirements of size or performance, parallelism is sometimes the onlyfeasible solution in any given technology.

Parallelism can be achieved by partitioning and distribution of thesystem. The combined functionality of the distributed parts 30-1 to30-N, is identical to the functionality of the original system 30.Therefore a packet 31 is also partitioned (into segments) andtransported through the communication system by processing differentparts of the packet in different parts of the system. Partitioning ofthe system and the packet is depicted in the lower part of FIG. 3. Atypical example of such a parallel system is the case where M=N and eachsegment of a packet is processed by a corresponding part of thecommunication system.

Segmentation of the incoming packets is assumed to be done by anexternal device 33, hereafter called ingress adapter. Similarly,re-assembly of the outgoing packet segments is also assumed to be doneby an external device 34, hereafter called egress adapter.

There are several ways to segment and distribute the functionalities ofa communication system as mentioned in the introduction. The problemconcerned with the present invention applies to distributedcommunication systems with centralized control, which is sometimesreferred to as a master-slave class of system.

A master-slave class of system may be connected in any arbitrarytopology such as a chain, a ring, a tree, or any combination of thesethree topologies. With reference now to FIG. 2, the ingress behavior ofa distributed communication system with centralized control is explainedrelative to a chain based topology, which is one possible embodimentamong many others.

A key attribute of a distributed communication system with centralizedcontrol is that its internal links can operate at much lower rate thanthe incoming external line rate. Assuming an external line rate R, acommunication system can be composed of multiple (say N) modulesoperated in parallel, resulting into individual module links beingoperated at rate R/N.

Incoming packets are partitioned by an ingress adapter into N identicalsegments before being sent over N different links or connections 20-0,20-1, . . . , 20-_(N-1), each operating at rate R/N. The first segmentcontaining the packet header (and possibly also payload) is sent to amaster module 21, whereas the N−1 other segments containing only datapayload are transmitted to a first, second, and further slave modules22-1 to 22-_(N-1). The highest achievable degree of parallelism isdictated by the size of the header which must entirely fit into a singlesegment. Therefore N cannot be greater than size of the packet dividedby size of the header. In the maximal expansion mode, the first segmentdoes not carry any payload.

Although all segments are sent at the same time by the ingress adapter,different segments will experience different propagation times τ₀ toτ_(N-1), depending on the topology and the length and quality of thelinks. Therefore, the N segments 23-0 to 23-_(N-1) of a given packetwill generally not arrive at the master and the slaves at the same orclose to the same time. The difference between the fastest and theslowest propagation time defines the data path skew window which isassumed to be normalized to a packet cycle time for sake of simplicity.It is also clear that for communication systems of very high bandwidthand/or large size, multiple packet segments from consecutive packets maybe in flight over every single link or connection 20-0 to 20-_(N-1).

When the master module 21 receives the segment 23-0 it extracts theheader information and handles the segment according to the routing andQuality of Service (QoS) information (handling information) carried bythe header. Next or possibly at the same time, a control information24-0 hereafter called derived control information, is generated andtransmitted to the slave module 22-1 over a control interface 25-0. Thederived control information 24-0 informs the first slave module 22-1about the control decision(s) made by the master module 21 and containsinformation required by the first slave module 22-1 to handle itsincoming segment 23-1. Therefore and similarly to the data links 20-0 to20-_(N-1), there will be multiple entities of derived controlinformation in flight over an interface 25-k (0≦k≦N−2).

In the chain-based topology assumed by FIG. 2, the derived controlinformation 24-0 received by the first slave module 22-1 is alsoforwarded to the second slave module 22-2 or the next in the chain andso on until the derived control information 24-_(N-2) reaches the lastslave module 22-_(N-1). In a treelike topology, derived controlinformation 24-0 could have been broadcast to all slaves at the time.

Back to the topology example of FIG. 2, all derived control information24-0 to 24-_(N-2) may also experience dissimilar propagation delays δ₀to δ_(N-2). In order for each slave to associate its segment 23-i(0≦i≦N−1) with the proper derived control information counterpart 24-j(0≦j≦N−2), a synchronization is useful between the data and controlflows at each slave module 22-1 to 22-_(N-1). This synchronization canbe performed by introducing a programmable delay in the data and/orcontrol paths, such that the differences between propagation delays ofthe paths can be compensated for. Practically, the compensation to beintroduced by the first slave module 22-1 corresponds to the propagationdelay δ₀ of the derived control information 24-0, minus the differencein propagation time between the links 20-0 and 20-1: (δ₀−(τ₀−τ₁)). Thecompensation for the second slave module 22-2, with respect to thederived control information sent by the master module 21, corresponds to((δ₀+δ₁)−(τ₀−τ₂)), while it is ((δ₀+δ₁+ . . . +δ_(N-2))−(τ₀−τ_(N-1)))for the last slave module 22-_(N-1) in the chain.

In order to introduce a programmable delay in the data and/or controlpaths of each slave module 22-1 to 22-_(N-1), the propagation delaydifference is measured, i.e. time shift information representing thisdifference is obtained, and then the locally required compensation delayis computed. The latter is described in more detail below. It should benoticed that for the sake of coherence with the problem descriptionabove, the description remains in the context of a chain-based topology.

With reference now to FIG. 4, a feature of the invention is to injectspecial synchronization packets, hereafter termed sync packets, into thecommunication system and to locally measure (inside each slave module)the propagation delay difference between the control and data flows.This is done by obtaining time stamps for the data path and the controlpath, which represent the time shift between the two paths. Sync packetsare separately distinguishable from the normal data stream and areinjected by the ingress adapter 46 under the control of a specificprocess 47. Sync packets are also split into segments 43-0 to 43-_(N-1)which are in turn, distinguishable from the packet segments of normaldata packets. In FIG. 4, this is indicated by the shaded packetsegments.

The synchronization packets could be transmitted through the systemperiodically at regular intervals between normal data packets. But incertain cases it may be sufficient to send only one sync packet when thewhole system is initialized, or to send packets (at irregular intervals)whenever it appears necessary.

When the master module 41 receives a sync packet segment 43-0 itgenerates a specific control information 44-0, hereafter called derivedsync control information, which it transmits to the first slave module42-1 over the control interface 45-0, similar to the transmission ofnormal (non-sync) derived control information related to a data packet.Derived sync control information is distinguishable from normal derivedcontrol information and is also shaded in FIG. 4.

With reference to FIG. 5 and FIG. 7, matching of the data and derivedcontrol information within each slave module, is described according toa preferred embodiment.

When one slave module receives a derived control information over itsingress control interface 510, it does two things. First, it immediatelyforwards it over an egress control interface 520 to the next slavemodule in the chain. Secondly, it inspects the incoming controlinformation with a sync control detector 534. If the incoming derivedcontrol information relates to a normal data packet, then it getswritten into a first FiFo buffer 530. If the incoming derived controlinformation is of type sync, it triggers the load of a controltime-stamp register 533 with a sequence number provided by a sequencer550 over a bus 551. In this preferred embodiment it is assumed that thederived sync control information gets also written into the first FiFobuffer 530, although this is optional.

The same kind of processing is applied to the incoming packet segmentsreceived over an ingress data interface 570. A sync packet detector 544sorts out the normal data segments from the sync packet segments. Normaldata packet segments are written into a second FiFo buffer 540, whereassync packet segments are used to trigger the load of a data time-stampregister 543 with the sequence number also provided by the sequencer550. If it was decided to write derived sync control information intothe first FiFo buffer 530, then also sync packet segments are writteninto the second FiFo buffer 540.

The sequencer 550 is basically a counter that is continuouslyincremented by the internal clock of the slave module. This sequencer550 can be forced to restart counting from zero after a specific resetcommand generated by a Reset Logic 590. This reset logic 590 generates areset command upon the detection of the first arrival of either a syncpacket segment or its corresponding derived sync control information bythe detectors 544 and 534. The reset command causes the sequencer 550 torestart counting from zero.

After transmission of a sync packet, a control program 580 (usuallycommon to all master and slave modules) is used to monitor the contentof the data and control time-stamp registers 533 and 543 via the bus581. This control program computes the difference between the content ofthe time-stamp registers and initializes a write pointer value 531 and541 accordingly via respective buses 582 and 583. In this particularembodiment the FiFo's 530 and 540 are assumed to be used as circularshift registers, but it is clear that a person skilled in the art caneasily come up with other approaches to implement a programmable digitaldelay. Operating the FiFo buffers 530,540 in a circular way, means thatonce they are enabled via the respective buses 582 and 583, both readand write pointers will start increasing (controlled by the internalclock) at the same time and that the distance between the write and readpointer will remain constant (under normal mode of operation, whichmeans continuous flow of incoming data, idle and/or sync packets, and aslong as no change in data and control path propagation delays isdetected locally after receipt of a sync packet by the circuitrysketched in FIG. 5).

The setting of the read and write pointers is done in the following way.Read pointers 532 and 542 are always set to zero. The setting of thewrite pointers 531 and 541 is based on the numbers retrieved from thedata and control time-stamp registers 533 and 543. If the controlprogram 580 determines that the data segment is received in advance ofits counterpart derived control information (i.e. {533}>{543}), then adelay is added into the incoming data path by initializing the datawrite pointer 541 with a value equal to the required delay. As thecontrol path does not need to be delayed, the control write pointer 531can be initialized with the same value as the read pointer, i.e. zero.

In the other case, when the control program determines that the controlpath is faster than the data path (i.e. {533}<{543}), then a delay isadded into the control path by initializing the control write pointer531 with the required delay and setting the data write pointer 541 tozero.

The required delay is equal to the (absolute value) of the differencebetween the contents of time stamp registers 533 and 543.

During normal mode of operation, content of the data and controltime-stamp registers 533 and 543 can also be monitored by the controlprogram 580 or any other hardware means implemented within the slavemodule, to check and verify that the distance between the two registervalues remains the same and that therefore the system remainssynchronized. Another way to check that the system remains synchronizedcan be implicitly achieved inside an input port controller 560, whenboth, sync packet segments and derived sync control information, getwritten into the FiFo buffers 540 and 530. If this is the case, any syncpacket segment read out of the second FiFo buffer 540 should alwaysmatch with another derived sync control information read out of thefirst FiFo buffer 530 or the system is not synchronized anymore.

It is to be noted that the preferred embodiment is capable of delayingboth the data and the control flows, even though it is expected that inmost realistic applications, the control path will be the slowest path.The mechanism and the logic to compensate on the delay of the controlflow is not required, if by design, the data path skew window D_(skw)(defined as being the maximum of the data skews between any of thepacket segments associated with a given packet) is always smaller thanthe latency of any of the control path between two consecutive slaves:D_(skw)<δ0, and D_(skw)<δ1, and . . . , and D_(skw)<δ_(N-2).

As mentioned earlier, sync packets can be sent either periodically atregular intervals (which would be normally the case), or it is possibleto send only one sync packet in the beginning, or send sync packets ondemand.

With reference to FIG. 2 again, the interval between transmission of thesync packets is determined by the ingress adapter to be at least as longas the longest possible latency in the control transmission path, plusthe maximum possible size of the data path skew window:

-   -   ((max δ0+max δ1+ . . . +max δ_(N-2))+D_(skw))        All the numbers used to compute the minimum possible interval        between transmission of two sync packets are easy to retrieve as        they correspond to absolute maximum values given by design. On        the other side, the only limit on the maximum possible interval        between transmission of two sync packets, is given by the        maximum sequence range addressable by the sequencer 550 and the        length of the FiFo buffers 530, 540.

It is also clear that the upper requirement relates to the specificembodiment of FIG. 5 and that a person skilled in the art can easilyimagine a further embodiment using another transmission rule for thesync packets.

With reference to FIG. 4 again, there are several methods fordistinguishing sync packet segments 43-i (0≦L≦N−1) and derived synccontrol information 44-j (0≦j≦N−2) from the normal data packet segmentsand normal derived control information. A preferred method envisioned isby encoding the packets, and by using special coding, such as forexample, the K-characters of the 8b/10b FibreChannel/Ethernet/Infiniband code, to specifically distinguish the syncpackets segments and the derived sync control information. Nevertheless,any other methods which clearly distinguish sync packets and derivedsync control from other packets would work as well.

With reference to FIG. 6, the egress part of a distributed communicationsystem with centralized control is described for two different cases.FIG. 6 a shows a case where the communication system 600 is asingle-stage system, or the last stage of a group of similarcommunication systems, and FIG. 6 b shows a case where the communicationsystem 600 is only one stage of an arrangement with plural stages, andwhere another system 660 b follows as the next stage in the arrangement.Similar to the ingress side of the communication system 600, packetsleaving the system are also partitioned into N identical segments andare sent over N different links or connections 640-0, 640-1, . . . ,640-_(N-1), each operating at rate R/N. The first segment containing thepacket header (and possibly also payload) is transmitted by the mastermodule 601, whereas the N−1 other segments containing only data payloadare transmitted by the slave modules 602-1 to 602-_(N-1).

The egress part of the communication system 600 either connects to anegress adapter 660 a which reassembles the outgoing data segments into asingle packet (FIG. 6 a), or to the ingress part of anothercommunication system 660 b in the case of a multistage interconnectconfiguration (FIG. 6 b).

As different outgoing data segments will also experience differentpropagation times over the links 640-0 to 640-_(N-1) (and also on thecontrol path 663-0 to 663-_(N-1) of the next stage in FIG. 6 b), asynchronization process similar to the ingress side of the communicationsystem is also used between the egress side of the communication system600 and the next block connected to it, i.e. the egress adapter 660 aand the next-stage communication system 660 b. This implies that thecommunication system 600 generates and injects special synchronizationpacket segments 650-0, 650-1, . . . , 650-_(N-1) which togetherrepresent one sync packet, over the links 640-0 to 640-_(N-1), in orderfor the next stage to locally measure the propagation delay differencesand to adjust them accordingly.

If the next stage is also a master-slave class of the communicationsystem (FIG. 6 b), then the sync packet segments 650-0 to 650-_(N-1)generated by the egress side of the communication system 600 are used bythe ingress side of the next stage 660 b to achieve localsynchronization of the data and derived control information, asdescribed previously. If the next stage is an egress adapter (FIG. 6 a),then the sync packet segments 650-0 to 650-_(N-1) generated by theegress side of the communication system 600 are used to measure therelative arrival time between the multiple packet segments in order torecombine them into a single packet that can be further processed and/orforwarded.

In both cases (FIG. 6 a and FIG. 6 b) the egress side of thecommunication system 600 behaves as an ingress adapter for the nextstage attached to it.

There are several methods for defining the injection time of the egresssync packets segments 650-0 to 650-_(N-1). A preferred method is toderive the injection time from the incoming sync packet segments 610-0to 610-_(N-1), while another method would be to derive the injectiontime directly from a specific egress process 604.

The first option is most likely to be used by a bufferless system inwhich incoming packets are immediately forwarded to an output portwithout being stored. In that particular case, an egress sync packetsegment can be generated whenever a sync packet segment and a derivedcontrol sync information match occurs into the input port controller 560(FIG. 5).

On the other hand, if the communication system 600 is a buffered system,ingress and egress sync processes are most likely decoupled from eachother. In that particular case the sync packet segments can beregenerated by the communication system itself if it implements aspecific egress sync process 604. When this process triggers theinjection of one sync packet, one sync packet segment 650-0 is generatedby the master module 601 and transmitted over the link 640-0. At thesame time, a derived sync control information, called derived egresssync control information, is also transmitted to all the slave modules602-1 to 602-_(N-1) over the control interface 603-0 to 603-_(N-2).Within each slave module 602, the derived egress sync controlinformation is then used locally to regenerate an egress sync packetsegment to be transmitted over the links 640-0 to 640-_(N-1). Anothercase that calls for decoupling the ingress and egress is when the delayson the egress control path differ from the delays on the ingress controlpath.

It is to be noted that the FIGS. 2, 4 and 6 show a single control pathfrom master to slaves. This does not exclude the possibility of havingmultiple distinct control paths. A typical example is depicted in FIG.1, where ingress and egress control paths are separate.

Any disclosed embodiment may be combined with one or several of theother embodiments shown and/or described. This is also possible for oneor more features of the embodiments.

1. A communication system for processing data packets, each packetincluding a header with control information and a data payload, thesystem comprising: an ingress port for receiving said data packets, inwhich ingress port each data packet is subdivided into segments; amaster unit and one or more slave units for parallel processing of saidsegments, the master unit is adapted to receive the header from eachpacket via a data path and the one or more slave units are adapted toreceive data segments via a data path, via a control path derivedcontrol information is passable from the master unit to the one or moreslave units; in which system, (a) synchronization providing means areprovided for sending synchronization packets also subdivided intosegments from the ingress port through the system over the same paths asnormal data packets, and for passing synchronization control informationthrough the system over the same paths as normal derived controlinformation; (b) each of the one or more slave units comprises timeshift information means for obtaining, when a synchronization packetsegment and its corresponding synchronization control information arereceived, time shift information representing the propagation delaydifference between the data path and the control path; and (c) each ofthe one or more slave units comprises delay means for delaying either adata segment or derived control information, in response to said timeshift information obtained by said time shift information means. 2.Communication arrangement for processing data packets each packetincluding a header with control information and a data payload, saidcommunication arrangement comprising an ingress port for receiving saiddata packets, in which ingress port each data packet is subdivided intosegments; comprising a communication system with a master unit and oneor more slave units for parallel processing of said segments, the masterunit is adapted to receive the header with control information from eachpacket via a data path and the one or more slave units are adapted toreceive data segments via a data path; and wherein derived controlinformation is passed from the master unit to the one or more slaveunits via a control path; in which arrangement, (a) means are providedfor sending synchronization packets also subdivided into segments fromthe ingress port through the system over the same paths as normal datapackets, and for passing synchronization control information through thesystem over the same paths as normal derived control information; (b)each slave unit comprises first means for obtaining, when asynchronization packet segment and its corresponding synchronizationcontrol information are received, time shift information representingthe propagation delay difference between the data path and the controlpath; and (c) each slave unit comprises second means for delaying eithera data segment or derived control information, in response to said timeshift information obtained by said first means.
 3. Communicationarrangement as in claim 2, characterized in that each of the one or moreslave units comprises, in connection with said first means for obtainingsaid time shift information, a sequencer in the form of a counter, whosecontent increases in response to local clock pulses of the respectiveslave unit.
 4. Communication arrangement as in claim 3, characterized inthat each of the one or more slave units comprises, in said first meansfor obtaining said time shift information, (a) a control time stampregister for storing the contents of the sequencer when synchronizationcontrol information derived from a synchronization packet is receivedvia the control path, and (b) a data time stamp register for storing thecontents of said sequencer when a synchronization packet segment isreceived via the data path.
 5. Communication arrangement as in claim 4,characterized in that internal or external control means are providedfor evaluating the contents of the time stamp registers in one of theone or more slave units, and for determining the difference representingthe time shift, and that each of the one or more slave units comprisesin said second means, (a) separate delay means for delaying packet datasegments and for delaying derived control information, and (b)activating means for selectively activating a delay in either one ofthese delay means in response to said control means.
 6. Communicationarrangement as in claim 5, characterized in that in each slave unit (a)said delay means comprise circular shift registers controlled by writeand read pointers, and (b) said activating means include write pointerregisters and read pointer registers, and setting means to set thecontents of one of the write pointer registers (541; 531) to a delayvalue representing the difference between the contents of said two timestamp registers, and to set the contents of the respective other writepointer register and of both read pointer registers to zero. 7.Communication arrangement as in claim 3, characterized in that each ofthe one or more slave units comprises means for resetting said sequencerin response to a first receipt of a synchronization packet segment orthe corresponding derived synchronization control information. 8.Communication arrangement in accordance with claim 2, further comprisingat least one egress port for reassembling said segments to form datapackets, characterized in that additional means are provided forcompensating propagation delay differences on the paths between themaster and slave units of the communication system on one hand, and theegress port on the other hand.
 9. Communication arrangement as in claim2, comprising at least two distributed communication systems,characterized in that additional means are provided for compensatingpropagation delay differences on the data paths between the egress linesof the modules of one distributed communication system and the ingresslines of the modules of the following distributed communication system,and the control path.
 10. A method for local synchronization in amaster-slave communication system designed for processing data packets,each packet comprising a header with control information and a datapayload and each receivable through at least one ingress port, in whichsystem each data packet is subdivided into segments in said ingress portfor parallel processing of said segments, said system comprising amaster unit and at least one slave units for parallel processing of saidsegments; wherein the master unit receives the header with controlinformation from each packet and the one or more slave units receivedata segments via a data path; and wherein derived control informationis passed from the master unit to the one or more slave units via acontrol path; the method comprising the following steps: (a) sending asynchronization packet, also subdivided into segments, from said ingressport through the system over the same paths as normal data packets, andpassing a synchronization control information derived from the header ofthe synchronization packet, through the system over the same paths asnormal derived control information; (b) obtaining in the one or moreslave units, when a synchronization packet segment and its correspondingsynchronization control information are received, time shift informationrepresenting the propagation delay difference between the data path andthe control path; and (c) in the one or more slave units, compensate forsaid propagation delay difference, represented by the time shiftinformation obtained in step (b), by delaying for each received packetsegment either the packet segment itself or the derived controlinformation.
 11. The method as in claim 10, wherein synchronizationpackets are transmitted from the ingress port through the system atregular intervals between normal data packets.
 12. A computer programproduct comprising a computer usable medium having computer readableprogram code means embodied therein for causing processing of datapackets each including a header with control information and a datapayload, the computer readable program code means in said computerprogram product comprising computer readable program code means forcausing a computer to effect the functions of: an ingress port forreceiving said data packets, in which ingress port each data packet issubdivided into segments; and a master unit and one or more slave unitsfor parallel processing of said segments, the master unit is adapted toreceive the header from each packet via a data path and the one or moreslave units are adapted to receive data segments via a data path, via acontrol path derived control information is passable from the masterunit to the one or more slave units; in which system (a) synchronizationproviding means are provided for sending synchronization packets alsosubdivided into segments from the ingress port through the system overthe same paths as normal data packets, and for passing synchronizationcontrol information through the system over the same paths as normalderived control information; (b) each of the one or more slave unitscomprises time shift information means for obtaining, when asynchronization packet segment and its corresponding synchronizationcontrol information are received, time shift information representingthe propagation delay difference between the data path and the controlpath; and (c) each of the one or more slave units comprises delay meansfor delaying either a data segment or derived control information, inresponse to said time shift information obtained by said time shiftinformation means.
 13. A computer program product comprising a computerusable medium having computer readable program code means embodiedtherein for causing functions of a Communication arrangement forprocessing data packets each including a header with control informationand a data payload, the computer readable program code means in saidcomputer program product comprising computer readable program code meansfor causing a computer to effect: an ingress port for receiving saiddata packets, in which ingress port each data packet is subdivided intosegments; and a communication system with a master unit and one or moreslave units for parallel processing of said segments, the master unit isadapted to receive the header with control information from each packetvia a data path and the one or more slave units are adapted to receivedata segments via a data path; and wherein derived control informationis passed from the master unit to the one or more slave units via acontrol path; in which arrangement (a) means are provided for sendingsynchronization packets also subdivided into segments from the ingressport through the system over the same paths as normal data packets, andfor passing synchronization control information through the system overthe same paths as normal derived control information; (b) each slaveunit comprises first means for obtaining, when a synchronization packetsegment and its corresponding synchronization control information arereceived, time shift information representing the propagation delaydifference between the data path and the control path; and (c) eachslave unit comprises second means for delaying either a data segment orderived control information, in response to said time shift informationobtained by said first means.
 14. A computer program product as recitedin claim 13, the computer readable program code means in said computerprogram product further comprising computer readable program code meansfor causing a computer to effect the function characterized in that eachof the one or more slave units comprises, in connection with said firstmeans for obtaining said time shift information, a sequencer in the formof a counter, whose content increases in response to local clock pulsesof the respective slave unit.
 15. A computer program product as recitedin claim 14, the computer readable program code means in said computerprogram product further comprising computer readable program code meansfor causing a computer to effect the function characterized in that eachof the one or more slave units comprises, in said first means forobtaining said time shift information, (a) a control time stamp registerfor storing the contents of the sequencer when synchronization controlinformation derived from a synchronization packet is received via thecontrol path, and (b) a data time stamp register for storing thecontents of said sequencer when a synchronization packet segment isreceived via the data path.
 16. A computer program product as recited inclaim 15, the computer readable program code means in said computerprogram product further comprising computer readable program code meansfor causing a computer to effect the function characterized in thatinternal or external control means are provided for evaluating thecontents of the time stamp registers in one of the one or more slaveunits, and for determining the difference representing the time shift,and that each of the one or more slave units comprises in said secondmeans, (a) separate delay means for delaying packet data segments andfor delaying derived control information, and (b) activating means forselectively activating a delay in either one of these delay means inresponse to said control means.
 17. A computer program product asrecited in claim 16, the computer readable program code means in saidcomputer program product further comprising computer readable programcode means for causing a computer to effect the function characterizedin that in each slave unit (a) said delay means comprise circular shiftregisters controlled by write and read pointers, and (b) said activatingmeans include write pointer registers and read pointer registers, andsetting means to set the contents of one of the write pointer registersto a delay value representing the difference between the contents ofsaid two time stamp registers, and to set the contents of the respectiveother write pointer register and of both read pointer registers to zero.18. An article of manufacture comprising a computer usable medium havingcomputer readable program code means embodied therein for causing localsynchronization in a master-slave communication system designed forprocessing data packets each comprising a header with controlinformation and a data payload and each receivable through at least oneingress port, in which system each data packet is subdivided intosegments in said ingress port for parallel processing of said segments,said system comprising a master unit and one or more slave units forparallel processing of said segments; wherein the master unit receivesthe header with control information from each packet and the one or moreslave units receive data segments via a data path; and wherein derivedcontrol information is passed from the master unit to the one or moreslave units via a control path, the computer readable program code meansin said article of manufacture comprising computer readable program codemeans for causing a computer to effect the steps of: (a) sending asynchronization packet, also subdivided into segments, from said ingressport through the system over the same paths as normal data packets, andpassing a synchronization control information derived from the header ofthe synchronization packet, through the system over the same paths asnormal derived control information; (b) obtaining in the one or moreslave units, when a synchronization packet segment and its correspondingsynchronization control information are received, time shift informationrepresenting the propagation delay difference between the data path andthe control path; and (c) in the one or more slave units, compensate forsaid propagation delay difference, represented by the time shiftinformation obtained in step (b), by delaying for each received packetsegment either one of the packet segment itself or the derived controlinformation.
 19. An article of manufacture comprising a computer usablemedium having computer readable program code means embodied therein forcausing local synchronization in a master-slave communication systemdesigned for processing data packets, the computer readable program codemeans in said article of manufacture comprising computer readableprogram code means for causing a computer to effect the steps of claim11.
 20. A program storage device readable by machine, tangibly embodyinga program of instructions executable by the machine to perform methodsteps for local synchronization in a master-slave communication systemdesigned for processing data packets, said method steps comprising thesteps of claim 11.